Prioritized Experience Replay
1 Overview
Prioritized experience replay was proposed by T. Schaul et al., and is widely used to speed up reinforcement learning (as far as I know).
Roughly speaking, mis-predicted observations will be learned more frequently. To compensate distorted probability, weight of learning is scaled to the opposite direction (cf. importance sampling).
In cpprb, PrioritizedReplayBuffer
class implements these
functionalities with proportional base (instead of rank base)
priorities as shown at bellow table.
\(i\)-th | Definition |
---|---|
Probability: \( P(i) \) | \( \frac{(P_{i}+\epsilon)^{\alpha}}{\sum _{j=0}^{N} (P_{j}+\epsilon)^{\alpha}} \) |
Weight: \( w_i \) | \( \frac{\left( \frac{1}{N} \frac{1}{P(i)}\right) ^{\beta}}{\left( \frac{1}{N} \frac{1}{\min _{j} P(j)}\right) ^{\beta}} \to \left(\frac{\min _{j} P(j)}{P(i)}\right) ^{\beta} \) |
You can add
priorities
together with other environment. If no
priorities
are passed, the stored maximum priority is used.
The dict
returned by sample
also has special key-values of
indexes
and weights
. The indexes
are intended to be passed to
update_priorities
to update their priorities after comparison with
new prediction. The weights
can be used to scale errors when
updating network parameters. Since the weights
are normalized by the
largest weight, as the original research did, the values are always
less than or equal to 1.
PrioritizedReplayBuffer
has hyperparameters alpha
(\( \alpha \))
and eps
(\( \epsilon \)) at constructor and beta
(\( \beta \)) at
sample
method. Their default values are 0.6
, 1e-4
, and 0.4
,
respectively. The detail is described in the original paper above.
Parameters | Default | Description |
---|---|---|
alpha (\( \alpha \)) |
\( 0.6 \) | Prioritized parameter. 0 means uniform sampling. |
beta (\( \beta \)) |
\( 0.4 \) | Compensation parameter. 1 means fully compensate (i.e. Importance Sampling) |
eps (\( \epsilon \)) |
\( 10^{-4} \) | Small value to avoid 0 priority. |
2 Example Usage
import numpy as np
from cpprb import PrioritizedReplayBuffer
buffer_size = 256
prb = PrioritizedReplayBuffer(buffer_size,
{"obs": {"shape": (4,4)},
"act": {"shape": 3},
"rew": {},
"next_obs": {"shape": (4,4)},
"done": {}},
alpha=0.5)
for i in range(1000):
prb.add(obs=np.zeros((4,4)),
act=np.ones(3),
rew=0.5,
next_obs=np.zeros((4,4)),
done=0)
batch_size = 32
s = prb.sample(batch_size,beta=0.5)
indexes = s["indexes"]
weights = s["weights"]
# train
# ...
new_priorities = np.ones_like(weights)
prb.update_priorities(indexes,new_priorities)
3 Notes
The constructor of PrioritizedReplayBuffer
(aka. __init__
) has
parameter check_for_update
. If the parameter is True
(default is
False
), the buffer traces updated indices after the last calling of
sample()
method and the update_priorities()
method updates
priorities only at unchanged indices. This feature is designed for
multiprocess learning in order to avoid mis-updating priorities of
already overwritten values. (This protection is always enabled at
MPPrioritizedReplayBuffer
)
4 Technical Detail
To choose prioritized sample efficiently, partial summation and minimum of pre-calculated weights are stored in Segment Tree data structure, which is written by C++ and which was an initial main motivation of this project.
To support multiprocessing, the Segment Tree can be lazily updated,
too. (MPPrioritizedReplayBuffer
)
Minibatch sampling is done by stratified sampling, which samples from every equal probability space, resulting smaller distributions among minibatches.